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Abstract 

Deep Convolutional Neural Networks (CNNs) 
have demonstrated excellent performance in im¬ 
age classification, but still show room for im¬ 
provement in object-detection tasks with many 
categories, in particular for cluttered scenes and 
occlusion. Modern detection algorithms like Re¬ 
gions with CNNs (Girshick et al., 2014) rely on 
Selective Search (Uijlings et al., 2013) to propose 
regions which with high probability represent ob¬ 
jects, where in turn CNNs are deployed for clas¬ 
sification. Selective Search represents a family of 
sophisticated algorithms that are engineered with 
multiple segmentation, appearance and saliency 
cues, typically coming with a significant run¬ 
time overhead. Furthermore, (Hosang et al., 
2014) have shown that most methods suffer from 
low reproducibility due to unstable superpixels, 
even for slight image perturbations. Although 
CNNs are subsequently used for classification in 
top-performing object-detection pipelines, cur¬ 
rent proposal methods are agnostic to how these 
models parse objects and their rich learned rep¬ 
resentations. As a result they may propose re¬ 
gions which may not resemble high-level ob¬ 
jects or totally miss some of them. To over¬ 
come these drawbacks we propose a boosting ap¬ 
proach which directly takes advantage of hierar¬ 
chical CNN features for detecting regions of in¬ 
terest fast. We demonstrate its performance on 
ImageNet 2013 detection benchmark and com¬ 
pare it with state-of-the-art methods. 


The copyright of this document resides with its authors. It may be 
distributed unchanged freely in print or electronic forms. 


1. Introduction 

Visual object detection is at the heart of many applica¬ 
tions in science and engineering, ranging from microscopic 
scales in medicine to macroscopic scale in space explo¬ 
ration. Historically these problems are mostly formulated 
as classification problems in a sliding-window framework. 
To this end a classifier is trained to differentiate one or more 
object classes from a background class by sliding a window 
over the whole image with a given stride and on multiple 
scales and aspect ratios predicting the class label for every 
single window. In practice, this demands the classification 
of more than a million windows for common images, re¬ 
sulting in a trade-off between run-time performance and 
classification accuracy for real-time applications. 

Recent years have seen a paradigm shift in object-detection 
systems replacing the sliding window step with object- 
proposal algorithms prior to classification (Girshick et al., 
2014; Szegedy et al., 2014b; Cinbis et al., 2013; Wang 
et al., 2013; He et al., 2014). These algorithms propose 
regions in an image which are predicted to contain the ob¬ 
jects of interest with high likelihood. Such an approach not 
only reduces the number of regions which have to be clas¬ 
sified from a million to a few thousands but also allows for 
spanning across a larger range of scales and aspect ratios 
for regions of interest. Depending on the execution time of 
the proposal method and the classifier, the reduction of re¬ 
gions for classification can lead to a significantly faster run 
time and allow the use of more sophisticated classifiers. 

Frameworks with region proposal methods and subsequent 
classification which are based on Convolutional Neural 
Networks (Girshick et al., 2014) achieve state-of-the-art 
performance on the ImageNet (Deng et al., 2009) detec¬ 
tion. A popular pipeline consists of three main steps; (i) 
several regions that are likely to be objects are generated 
by a proposal algorithm; (ii) deep CNNs with multi-way 
Softmax or Support Vector Machines (SVMs) on the top 
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classify these regions in order to detect all possible classes 
and instances per image; (iii) finally, an optional regression 
step further refines the location of the detected objects. 

A large body of work of region proposal algorithms exists 
(Hosang et ah, 2014). The majority of them are engineered 
to merge different segmentation, saliency and appearance 
cues, and some hierarchical scheme, with parameters tuned 
to specific datasets. Here we propose a data-driven region- 
proposal method which is based on features extracted di¬ 
rectly from lower convolutional layers of a CNN. We use 
a fast binary boosting framework (Appel et ah, 2013) to 
predict the objectness of regions, and finally deploy a re¬ 
gressor which uses features from the upper convolutional 
layer after pooling to rehne the localization. The proposed 
framework achieves the top recall rate on ImageNet 2013 
detection for Intersection over Union (loU) localization be¬ 
tween 50 — 65%. Finally, we apply our proposal algorithm 
instead of Selective Search in the baseline Regions-with- 
CNN pipeline (Girshick et ah, 2014) resulting in 8% im¬ 
provement over the state-of-the-art with a single prediction 
model while achieving considerably faster test time. 

In the two following subsections we briefly review region 
proposal algorithms and present the motivation behind our 
work. In Section 2 we present the details of our algorithm 
and in Section 3 we show the experimental study and re¬ 
sults. In Section 4, we place our algorithm in line with the 
Regions-with-CNN framework and benchmark its perfor¬ 
mance on the ImageNet 2013 detection challenge. Finally, 
in Section 5 we state our conclusions and point out direc¬ 
tions for future research. 

1.1. Prior work 

Most currently leading frameworks use various segmen¬ 
tation methods and engineered cues in hierarchical algo¬ 
rithms to merge smaller areas (e.g., superpixels) to larger 
ones. During this hierarchical process boxes which likely 
are objects are proposed. In turn CNNs are applied as 
region descriptors, whose receptive fields are rectangular. 
This relaxes the need to perform accurate segmentation. 
Instead a rectangle around regions that are most probably 
objects is sufficient. 

We briefly review some representative methods which are 
evaluated in detail in (Hosang et al., 2014). 

Selective Search (Uijlings et al., 2013), which is currently 
the most popular algorithm, involves no learning. Its fea¬ 
tures and score functions are carefully engineered on Pascal 
VOC and ILSVRC so that low-level superpixels (Felzen- 
szwalb & Huttenlocher, 2004) are gradually merged to rep¬ 
resent high-level objects in a greedy fashion. It achieves 
very high localization accuracy due to the initial over¬ 
segmentation at a time overhead. RandomizedPrim ’s (Ma- 


nen et al., 2013) is similar to Selective Search in terms of 
features and the process of merging superpixels. However, 
the weights of the merging function are learned and the 
whole merging process is randomized. 

Then there is a family of algorithms which invest signifi¬ 
cant time in a good high-level segmentation. Constrained 
Parametric Min-Cuts (CPMC) (Carreira & Sminchisescu, 
2012) generates a set of overlapping segments. Each pro¬ 
posal segment is the solution of a binary segmentation 
problem. Up to 10,000 segments are generated per im¬ 
age, which are subsequently ranked by objectness using 
a trained regressor. (Rantalankila et al., 2014), similar in 
principle to (Uijlings et al., 2013) and (Carreira & Smin¬ 
chisescu, 2012), merges a large pool of features in a hier¬ 
archical way starting from superpixels. It generates several 
segments via seeds like CPMC does. 

(Endres & Hoiem, 2010; 2014) combine a large set of cues 
and deploy a hierarchical segmentation scheme. Addition¬ 
ally, they learn a regressor to estimate boundaries between 
surfaces with different orientations. They use graph cuts 
with different seeds and parameters to generate diverse seg¬ 
ments similar to CPMC. Multiscale Combinatorial Group¬ 
ing (MCG) (Arbelaez et al., 2014) combines efficient nor¬ 
malized cuts and CPMC (Carreira & Sminchisescu, 2012) 
and achieves competitive results within a reasonable time 
budget. 

In the literature several methods which engineer objectness 
have been proposed in the past. Objectness (Alexe et al., 
2012) was one of the first to be published, although its per¬ 
formance is inferior to most modern algorithms (Hosang 
et al., 2014). Objectness estimates a score based on a com¬ 
bination of multiple cues such as saliency, colour contrast, 
edge density, location and size statistics, and the overlap 
of proposed regions with superpixels. (Rahtu et al., 2011) 
improves Objectness by proposing new cues and combine 
them more effectively. 

Moreover, fast algorithms with approximate detection have 
been recently introduced. Binarized Normed Gradients for 
Objectness (BING) (Cheng et al., 2014) is a simple linear 
classifier over edge features and is used in a sliding window 
manner. In stark contrast to most other methods, BING 
takes on average only 0.2s per image on CPU. EdgeBoxes 
(Zitnick & Dollar, 2014) is similar in spirit to BING. A 
scoring function is evaluated in a sliding window manner, 
with object boundary estimates and features which are ob¬ 
tained via structured decision forests. 

Given that engineering an algorithm for specific data is not 
always a desired strategy, there are recent region-proposal 
algorithms which are data-driven. Data-driven Objectness 
(Kang et al., 2014) is a practical method, where the like¬ 
lihood of an image corresponding to a scene object is es- 
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timated through comparisons with large collections of ex¬ 
ample object patches. This method can prove very effective 
when the notion of object is not well defined through topol¬ 
ogy and appearance, such as daily activities. 

A contemporary work which follows the data-driven path 
is Scalable, High-quality object detection (Szegedy et al., 
2014b). After they revisited their Multibox algorithm (Er- 
han et al., 2014), they are able to integrate region propos¬ 
als and classification in one step end-to-end. By deploying 
an ensemble of models with robust loss function and their 
newly introduced contextual features, they achieve state- 
of-the-art performance on the detection task. 

1.2. Motivation 

Recent top-performing approaches on ILSVRC detection 
are based on hand-crafted methods for region proposals, 
such as Selective Search (Uijlings et al., 2013). However, 
although these methods are tuned on this benchmark, they 
miss several objects. For example. Selective Search pro¬ 
poses on average 2,403 regions per image in (Girshick 
et al., 2014) with 91.6% recall of ground truth objects (for 
0.5 loU threshold). In that case, even with oracle classifi¬ 
cation and subsequent localization, more than 8% of object 
instances will not be detected. This leaves significant room 
for improvement in future algorithms. 

Furthermore, algorithms which build on superpixels can be 
unstable even for slight image perturbations resulting in 
low generalizations (Hosang et al., 2014) on different data. 

Object proposal algorithms which are based on low-level 
cues are agnostic on how a learned network perceives the 
class of objects in the space of natural images. A CNN 
which is trained in a supervised manner to recognize 1000 
object categories on ILSVRC, has learned a rich set of 
representations to identify local parts and hierarchically 
merge them toward certain class instances at the top layers. 
As opposed to Segmentation-based methods which merge 
segments based on simple binary criteria like existence of 
boundaries and color or not, CNN features ideally span the 
manifold of natural images which is a very small subspace 
inside a high dimensional feature space. 

In practice. Selective Search is not scale-invariant. Never¬ 
theless, it is engineered to work well on ILSVRC and Pas¬ 
cal VOC data with careful parameter tuning. To this end. 
Regions with CNN (Girshick et al., 2014) resizes all images 
to a width of 500 pixels to serve its purpose. However, a 
data-driven algorithm which is not constrained to build on 
superpixels bypasses this step. 

Regions with CNN uses a linear regressor after classifying 
the proposed regions to better localize the bounding boxes 
around the object. For this purpose they deploy pool-5 
CNN features. This techinque can be applied to proposals 


in the first place. Of course, class-specific regressors are 
applicable only after classification, but neverthess generic 
object regressors can also enhance region localization. 

All in all, the motivation of this work is to address the con¬ 
ceptual discontinuity between the object proposal method 
and the subsequent classification. Using hand-crafted 
scores in the proposal stage and applying a convolutional 
neural network for classification results not only typically 
in slow run-time but also in the aforementioned instabili¬ 
ties. The key idea is to utilize the convolutional responses 
of a network whose weights are learned to recognize differ¬ 
ent object classes also for proposals. Finally, we formulate 
the problem with a boosting framework to guarantee fast 
execution time. 


2. Algorithm 

Detector: We are deploying a binary boosting framework 
to learn a classifier with desired output yi G {—1,1} for 
an image patch i,i G (1,..., N}, where 1 stands for ob¬ 
ject and —1 for background. The input samples Xi are fea¬ 
ture vectors which describe an image patch i. The features 

k 

Xi are a selected subset of convolutional responses convj^ 
from a Proposal CNN (cf. Fig. 1), where j pertains to con¬ 
volutional layer j G (1,..., L} and kj spans the number 
of feature maps for this layer (e.g., alexNet (Krizhevsky 
et al., 2012) uses L = 5 and ki G {!,..., 96}). Our Pro¬ 
posal CNN is the VGGs model from (Chatfield et al., 2014), 
whose first-layer convolutional responses are 110 x 110 
pixels, and therefore provides double resolution compared 
to alexNet. 

Aggregate-channel features from (Dollar et al., 2009) are 
used, where deep convolutional responses serve the role of 
channels, while we deploy a modified version of the fast 
setting provided by (Appel et al., 2013). Thus, efficient 
AdaBoost is used to train and combine 2, 048 depth-two 
trees over the d x d x F candidate features (channel pixel 
lookups), where d is the baseline classifier’s size and F 
is the number of convolutional responses, i.e., the patch 
descriptors (e.g., VGGs architecture has F = 96 kernels 
in the first layer). The convolutional responses from all 
positive and negative patches which are extracted from the 
training set are rescaled to a fixed d x d size (e.g, d = 25) 
before they serve as input to the classifier. In practice, clas¬ 
sifiers with various d can be trained to capture different res¬ 
olutions of these representations. On testing all classifiers 
are applied to the raw image and their detections are aggre¬ 
gated and non-maximally suppressed jointly. 

Hierarchical features: In order to evaluate objectness in 
different patches, we train the classifier with several pos¬ 
itive and negative samples, which are extracted from Pas¬ 
cal 2007 VOC (Everingham et al., 2010). Positive samples 
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Figure 1. Proposed object detection framework based on boosting hierarchical features. Regions which correspond to objects and back¬ 
ground patches are extracted from training images. Then the convolutional responses from first layers of a Proposal CNN are used 
to describe these patches. These are the input to a boosting model which differentiates object proposals from background. Finally a 
Category CNN is employed to classify the proposals from the Boosting classifier into object categories. 


are the ones that correspond to the ground truth annotated 
objects, while negatives are defined as rectangular sam¬ 
ples randomly extracted from the training set at different 
scales and aspect ratios, which have less than 0.3 Intersec¬ 
tion over Union overlap with the positives. For our exper¬ 
iments we considered patches sampled from the validation 
sets of VOC 2007 and ImageNet 2013 detection datasets, 
since both of them are exhaustively annotated*. Naturally 
there is a margin for improvement with more sophisticated 
sampling, given that the VOC and ILSVRC categories do 
not include all possible object classes that can appear in an 
image. 

In order to properly crop the objects from the convolu¬ 
tional responses along the hierarchy, the image level an¬ 
notations have to be mapped to the corresponding regions 
from the intermediate representations effectively. There¬ 
fore, the supports of pooling and convolutional kernels de¬ 
termine a band within the rectangular box which has to be 
cropped out, so that information outside the object’s area 

'This means that all object instances belonging to C classes 
are fully annotated (C = 20 for Pascal, and C — 200 for Ima¬ 
geNet data), which prevents us from extracting supposedly nega¬ 
tive samples which actually correspond to objects. 


can be safely ignored. In practice after two pooling stages 
the area that should be cropped without including too much 
background information becomes very narrow. In our ex¬ 
periments we consider filter responses from the first two 
layers. We have found that using only kernels from the first 
convolutional layer, before any spatial pooling is applied, 
gives the best performance. 

Most first-layer kernels resemble anisotropic Gaussians 
and color blobs (cf. Fig. 1). As a matter of fact our method 
relates to BING and EdgeBoxes, which use edge features, 
and methods that use scores which account for color sim¬ 
ilarity (e.g.. Selective Search). By inspecting the convo¬ 
lutional responses of the first layer, we observe that some 
kernels are able to capture the texture of certain objects and 
thus disentangle them from the background and the other 
objects (such as the first and third kernels at Fig. 2). This 
family of features provides quick detection, as the time- 
consuming high-level segmentation is avoided. However, 
it has the drawback that local information from lower lay¬ 
ers cannot naturally compose non-rigid high-level objects. 
Nevertheless, a subsequent regression step which leverages 
features from the upper convolutional layers can help with 
these cases. This is described at a following paragraph. 
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Figure 2. An image from Pascal VOC (Everingham et al., 2010) 
and its convolutional responses with representative first-layer fil¬ 
ters. In order to classify object candidates a binary boosting 
framework is trained with positive (green) and negative (red) sam¬ 
ples which are extracted from CNN’s lower layers. 


Testing: In order to detect objects in previously unseen 
images we apply the learned classifier in a sliding window 
manner densely in S different scales and R aspect ratios. 
We typically sample S' = 12 scales and R = 3 aspect 
ratios. Non-maximum suppression (NMS) is used to re¬ 
ject detections with more than U Intersection over Union 
overlap for every (scale, aspect ratio) combination. Finally, 
after detections from all scales and aspect ratios are aggre¬ 
gated, another joint NMS with V loU is applied. We ex¬ 
perimented with different parameters and we use U = 63% 
and V = 90% in our reported results in Fig. 3. 

Bounding-Box Regression: After extracting the proposals 
per image, a subsequent regression step can be deployed to 
refine their localization. As proposed by (Girshick et ah, 
2014), a linear regressor is used with regularization con¬ 
stant A = 1, 000. For training we use all ground truth anno¬ 
tations G* and our best detection P® per ground truth for all 
training images i,i G {1,..., N} from Pascal VOC 2007. 
The best detection is defined as the one with the highest 
overlap with the ground truth. We throw away pairs with 
less than 70% loU overlap. The goal of the regressor is 
to learn how to shift the locations of P towards G given 
the description of detected bounding box (j). The transfor¬ 
mations are modeled as linear functions of pool^ features, 
which are obtained by forward propagating the P regions 
through the Proposal CNN. 


3. Experiments 

In order to test the efficacy and performance of our al¬ 
gorithm we performed experiments on data from the Im- 
ageNet 2013 detection challenge^ (Deng et al., 2009). 
We follow the approach proposed in the review paper of 
(Hosang et al., 2014) and we report the obtained perfor¬ 
mance vs. localization accuracy (Fig. 3) and number of 
candidates per image (Fig. 4). Specifically, we calcu¬ 
late the recall of ground truth objects for various local¬ 
ization thresholds using the Intersection-over-Union (loU) 
criterion, which is the standard metric on Pascal VOC. In 
Fig. 3 we demonsrate our performance compared to state- 
of-the-art methods and three baselines, as they have been 
evaluated by (Hosang et al., 2014). Each algorithm is al¬ 
lowed to propose up to 10, 000 regions per image on aver¬ 
age. The methods are sorted based on the Area-Under-the- 
Curve (AUC) metric, while in parentheses is the average 
number of proposed regions per image. 

Recall vs. localization: Our method belongs to the fam¬ 
ily of algorithms with fast and approximate object detec¬ 
tion, such as BING and EdgeBoxes. These algorithms pro¬ 
vide higher recall rate but poorer localization as opposed 
to methods that use high-level segmentation cues like Se¬ 
lective Search. The latter ones are considerably slower but 
more accurate in localizing the objects due to boundary in¬ 
formation. In Table 1 we provide the recall rate for vary¬ 
ing localization accuracy via loU criterion. Our method 
provides the highest recall until around 65% loU overlap. 
We also provide in Eig. 3 and Table 1 the gain in per¬ 
formance when we jointly use Selective Search and our 
method while still constraining the number of proposals to 
be less than 10, 000. In that case the benefit is mutual, as 
Selective Search provides better localization, while our al¬ 
gorithm higher recall, i.e, higher retrieval rate of ground 
truth objects for localization accuracy less than 65% loU. 
A small subset of images have been blacklisted in the eval¬ 
uation process per ILSVRC policy. 

Time analysis: The complexity on testing is linear to each 
of the number of deployed classifiers, scales, aspect ratios, 
and image size, when the other parameters are held con¬ 
stant. In Table 1 an estimation of average time on testing 
is shown at the last column for our framework and others, 
as the latter ones have been evaluated by (Hosang et al., 
2014). Our algorithm is quite efficient in both training 
and testing. More specifically, extracting convolutional re¬ 
sponses for the validation image set of ImageNet 2013 de¬ 
tection benchmark takes only a few minutes with Caffe (Jia 
et al., 2014) on a modern machine (e.g., testing one image 
with a single KAO GPU takes 2ms) mainly because of time 
needed to save the features. However, this can be done of- 

^As opposed to the training set for both the boosting frame¬ 
work and the regressor, which is Pascal 2007 VOC. 
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Imagenet 2013 validation set 



loU overlap threshold 


Figure 3. Proposals quality on ImageNet 2013 validation set when at most 10,000 regions are proposed per image. On recall versus loU 
threshold curves, the number indicates area under the curve (AUC), and the number in parenthesis is the obtained average number of 
proposals per image. Statistics of comparison methods come from (Hosang et ah, 2014). Our curves are drawn dashed. 


Recall (%) for various loU thresholds 

loU > 0.5 

loU > 0.65 

loU > 0.8 

Testing time (s) 

Selective Search (Uijlings et ah, 2013) 

94.6 

89.0 

72.2 

10 

Randomized Prim (Manen et ah, 2013) 

92.3 

82.0 

61.2 

1 

MCG (Arbelaez et ah, 2014) 

91.8 

81.0 

60.6 

30 

Edge Boxes (Zitnick cfe Dollar, 2014) 

93.1 

86.6 

49.7 

0.3 

Boosting Convolutional Features 

98.1 

89.4 

38.7 

2 

Endres 2010 (Endres &. Hoiem, 2010) 

81.1 

67.7 

46.4 

100 

BING (Cheng et ah, 2014) 

95.5 

43.0 

7.2 

0.2 

Boosting Conv Features and Selective Search 

97.7 

91.9 

75.3 

12 

Gaussian 

85.3 

72.5 

51.3 

0 

Sliding window 

90.8 

56.9 

14.1 

0 

Superpixels 

51.3 

26.7 

10.0 

1 


Table 1. A comparison of our method with various category-independent object detectors on the Validation set of ImageNet2013 detec¬ 
tion data. We compare in terms of recall of ground truth object annotations for various overlap thresholds. In order to be consistent 
with the literature, the strict Intersection-over-Union (loU) Pascal VOC criterion is used. The methods are sorted according to the AUC, 
similarly to Fig. 3. In bold font the top-2 methods per loU threshold. Representative testing times are shown at the last column. 
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fline for popular datasets. Training a modified version of 
boosting framework (Appel et al., 2013) which is now part 
of Dollar’s Matlab toolbox (Dollar, 2013) on Pascal VOC 
2007 data (train-val and test, i.e., 9,963 images with 24,640 
annotated objects) with a high-end multi-core CPU takes 
about three hours. This consists of training on all positives 
and 20k negatives, and additionally three rounds of boot¬ 
strapping, when at each round 20fc more negatives are ex¬ 
tracted among classifier’s false positives. The training time 
increases for larger values of baseline detector’s size, such 
as d = 40. But this still does not affect the testing time. 

When testing the learned classifier is applied densely in 
sliding window fashion on 20,121 validation images from 
ImageNet detection. All possible windows in S' = 12 dif¬ 
ferent scales and R = 3 different aspect ratios are evaluated 
by the classifier, which outputs for each window its confi¬ 
dence to be an object. Greedy non-maximal suppression is 
performed, where bounding boxes are processed in order of 
decreasing confidence, and once a box is suppressed it can 
no longer suppress other boxes. Separate and joint NMS 
are deployed with U = 63% and V = 90% loU thresholds, 
correspondingly. Testing on a multi-core CPU takes about 
2s per image. 

In Fig. 4 we show performance comparisons in terms of at 
least 50% loU recall for different number of proposed re¬ 
gions. Our scheme is the most effective when at least 1,000 
regions are proposed. For a smaller number of proposals 
the performance degrades fast. This is mainly because of 
significant non-maximal suppression that is applied to re¬ 
duce the number of proposals, while starting from a large 
number of scales and aspect ratios. In practice a more so¬ 
phisticated design for less proposals can improve the recall 
rate further especially in [100 — 1, 000] region. For small 
number of candidates an alternative strategy could be to 
deploy regression to cluster neighboring regions in unique 
detection, instead of using non-maximum suppression. 

4. ImageNet detection challenge 

In order to investigate in practice how the recall- 
localization trade-off affects the effectiveness of the ob¬ 
ject candidates on a detection task, we put the challenge 
to the test and evaluate the overall performance on Ima¬ 
geNet 2013 benchmark. We introduce our algorithm into 
Regions-with-CNN detection framework (Girshick et al., 
2014) by replacing Selective Search (Uijlings et al., 2013) 
and proposing our regions instead at the first step. 

In Table 2 we show the mean and median average preci¬ 
sion on a subset of validation set. We use the {vail, val2} 
split as was performed by (Girshick et al., 2014). We use 
their pretrained CNN and SVM models as category CNN, 
which are trained on {vali,trainik}, i.e., 9,887 validation 


Imagenet 2013 validation set 

1 -^^^^^- ; ; - ^^^^-r 



Endres 

-sliding window 

- BING 

- superpixels 

oL-^^^^^— . I. 

10 10 10 

# candidates 

Figured. Proposals quality on ImageNet 2013 validation set in 
terms of detected objects with at least 50% loU for various aver¬ 
age number of candidates per image. Compared to all other meth¬ 
ods from (Hosang et al., 2014), our method is the most effective in 
terms of ground truth object retrieval when at least 1, 000 regions 
are proposed and accurate localization is not a major concern. 


images and 1,000 ground truth positives per class from c 
lassification set. The deployed Proposal CNN is the VGGs 
model from (Chatfield et al., 2014), which is pretrained 
on ILSVRC2012 classification dataset. Given that Selec¬ 
tive Search is not scale-invariant, all images are rescaled 
to have 900 pixels width while preserving the aspect ra¬ 
tio. Thus, Selective Search proposes on average 5,826 re¬ 
gions per image. In (Girshick et al., 2014) all images are 
rescaled to have 500 pixels width, which yields 29.7 and 
29.2 mean and median AP for 2,403 regions on average. 
For our method we used the model that we demonstrate in 
Figs. 3 and 4, which generates 9,927 proposals on average. 

Our improvement is a product of two factors; first, higher 
recall of ground truth objects within roughly 50 — 70% 


Average Precision (AP) 

Mean AP 

Median AP 

Selective Search 

31.5 

30.2 

Boosting Deep Feats 

34.0 

32.5 


Table 2. Mean and median average precision on ImageNet 2013 
detection task. We deploy the state-of-the-art Regions with Con¬ 
volutional Neural Networks (R-CNN). Comparison when regions 
are proposed by Selective Search and our method, correspond¬ 
ingly. There is no post-processing regression step. 
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loU threshold and, second, a larger number of propos¬ 
als. Coarse localization is corrected to some extent from 
subsequent steps of R-CNN due to the robustness of con¬ 
volutional neural network in terms of object location and 
partial occlusion. Further improvement is expected if 
class-specific regression is introduced at the top of pre¬ 
diction, so that the detected bounding boxes are better 
located around the objects, which could prove to be es¬ 
pecially helpful in our algorithm given that we leverage 
no boundary/segmentation cues. Finally, an ensemble of 
models along with more sophisticated architectures (e.g., 
GoogLeNet (Szegedy et al., 2014a), MSRA PReLU-nets 
(He et al., 2015), very-deep nets (Simonyan & Zisserman, 
2014), etc.) could further improve our results. 

5. Discussion 

Features learned in convolutional neural networks have 
been proven to be very discriminative in recognizing ob¬ 
jects in natural images by mapping them on small man¬ 
ifolds of a very higher-dimensional feature space. The 
boosting classifier learns to map image data points to the 
union of all these low-dimensional manifolds. We hypoth¬ 
esize that it is still a relatively small subspace, which pre¬ 
serves the notion of object and includes most instances that 
can be found in popular visual datasets such as Pascal VOC 
and ILSVRC. 

In this paper we propose a framework which is able to 
benefit from the hierarchical and data-adaptive features ex¬ 
tracted from a convolutional neural network, as well as 
from the quick training and test time of state-of-the-art 
boosting algorithms. There are many directions to ex¬ 
plore this idea further: in this paper the feature responses 
extracted from different layers have equal weight during 
training, but exploiting the hierarchy in a top-down fashion 
might result in faster and more accurate predictions. Ad¬ 
ditionally, a data-driven regression mechanism which cap¬ 
tures gradient information could improve the localization 
of regions proposed by our framework. 

Our framework can also be applied to other areas, such 
as medical imaging, text detection and planetary science. 
Hand-engineering proposal detectors is quite challenging, 
as we need to come up with new similarity metrics, saliency 
and segmentation cues. However, instead of designing 
score functions and features from scratch for each new do¬ 
main, learned deep convolutional features can be used, af¬ 
ter a network has been trained on a sufficiently large and 
representative sample of related data. Finally, non-linear 
tree-based classifiers like boosting or random forests can 
provide a framework for fast inference, while avoiding the 
overhead of complete propagation in deep neural networks 
and at the same time being flexible to opt for a subset of 
actionable features for the task. 
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